BTS 510 Lab 9

set.seed(12345)
library(tidyverse)
Warning: package 'purrr' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Stat2Data)
theme_set(theme_classic(base_size = 16))

1 Learning objectives

  • Describe maximum likelihood estimation for linear regression
  • Describe hypothesis testing for linear regression coefficients, including the sampling distribution used and degrees of freedom

2 Data

  • FirstYearGPA data from the Stat2Data package: n = 219 subjects
    • GPA: First-year college GPA on a 0.0 to 4.0 scale
    • HSGPA: High school GPA on a 0.0 to 4.0 scale
    • SATV: Verbal/critical reading SAT score
    • SATM: Math SAT score
    • Male: 1= male, 0= female
    • HU: Number of credit hours earned in humanities courses in high school
    • SS: Number of credit hours earned in social science courses in high school
    • FirstGen: 1= student is the first in her or his family to attend college, 0=otherwise
    • White: 1= white students, 0= others
    • CollegeBound: 1=attended a high school where >=50% students intended to go on to college, 0=otherwise

3 Tasks

Question 1: How do demographic variables (Male, FirstGen and White) predict first year college GPA (GPA)?

Question 2: How does HS GPA (HSGPA) predict first year college GPA (GPA) over demographic variables (Male, FirstGen and White)?

  1. Run the two models above.
library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
library(Stat2Data)
data(FirstYearGPA)
head(FirstYearGPA)
   GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound
1 3.06  3.83  680  770    1  3.0  9.0        1     1            1
2 4.15  4.00  740  720    0  9.0  3.0        0     1            1
3 3.41  3.70  640  570    0 16.0 13.0        0     0            1
4 3.21  3.51  740  700    0 22.0  0.0        0     1            1
5 3.48  3.83  610  610    0 30.5  1.5        0     1            1
6 2.95  3.25  600  570    0 18.0  3.0        0     1            1
m1 <- lm(data = FirstYearGPA, GPA ~ Male + FirstGen + White)
summary(m1)

Call:
lm(formula = GPA ~ Male + FirstGen + White, data = FirstYearGPA)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0622 -0.3192  0.0100  0.3658  1.1074 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.86850    0.07516  38.168  < 2e-16 ***
Male         0.02414    0.06091   0.396 0.692303    
FirstGen    -0.13630    0.09807  -1.390 0.166045    
White        0.29366    0.07657   3.835 0.000165 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4475 on 215 degrees of freedom
Multiple R-squared:  0.08856,   Adjusted R-squared:  0.07584 
F-statistic: 6.963 on 3 and 215 DF,  p-value: 0.0001711
vif(m1)
    Male FirstGen    White 
1.009554 1.063783 1.063923 
vcov(m1)
             (Intercept)          Male     FirstGen         White
(Intercept)  0.005648282 -0.0015445844 -0.002648605 -0.0046992823
Male        -0.001544584  0.0037096086  0.000356494 -0.0002834014
FirstGen    -0.002648605  0.0003564940  0.009618244  0.0017527512
White       -0.004699282 -0.0002834014  0.001752751  0.0058626068
m2 <- lm(data = FirstYearGPA, GPA ~ Male + FirstGen + White + HSGPA)
summary(m2)

Call:
lm(formula = GPA ~ Male + FirstGen + White + HSGPA, data = FirstYearGPA)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.05776 -0.25735  0.04178  0.28247  0.74665 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.95274    0.25597   3.722 0.000253 ***
Male         0.06194    0.05416   1.144 0.254080    
FirstGen    -0.18508    0.08709  -2.125 0.034719 *  
White        0.25738    0.06798   3.786 0.000199 ***
HSGPA        0.55967    0.07220   7.751 3.63e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3963 on 214 degrees of freedom
Multiple R-squared:  0.2884,    Adjusted R-squared:  0.2751 
F-statistic: 21.68 on 4 and 214 DF,  p-value: 4.96e-15
vif(m2)
    Male FirstGen    White    HSGPA 
1.017807 1.069368 1.068992 1.016413 
vcov(m2)
              (Intercept)          Male      FirstGen         White
(Intercept)  0.0655185804 -0.0024171368 -0.0005220844 -0.0025294273
Male        -0.0024171368  0.0029337643  0.0002489512 -0.0002451420
FirstGen    -0.0005220844  0.0002489512  0.0075845739  0.0014043929
White       -0.0025294273 -0.0002451420  0.0014043929  0.0046207882
HSGPA       -0.0178460690  0.0003521719 -0.0004544480 -0.0003379714
                    HSGPA
(Intercept) -0.0178460690
Male         0.0003521719
FirstGen    -0.0004544480
White       -0.0003379714
HSGPA        0.0052135130
  • Mean center HSGPA to make intercept more interpretable
FirstYearGPA <- FirstYearGPA %>%
  mutate(HSGPA_C = HSGPA - mean(HSGPA, na.rm = TRUE))
mean(FirstYearGPA$HSGPA, na.rm = TRUE)
[1] 3.45274
m2a <- lm(data = FirstYearGPA, GPA ~ Male + FirstGen + White + HSGPA_C)
summary(m2a)

Call:
lm(formula = GPA ~ Male + FirstGen + White + HSGPA_C, data = FirstYearGPA)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.05776 -0.25735  0.04178  0.28247  0.74665 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.88512    0.06660  43.321  < 2e-16 ***
Male         0.06194    0.05416   1.144 0.254080    
FirstGen    -0.18508    0.08709  -2.125 0.034719 *  
White        0.25738    0.06798   3.786 0.000199 ***
HSGPA_C      0.55967    0.07220   7.751 3.63e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3963 on 214 degrees of freedom
Multiple R-squared:  0.2884,    Adjusted R-squared:  0.2751 
F-statistic: 21.68 on 4 and 214 DF,  p-value: 4.96e-15
vif(m2a)
    Male FirstGen    White  HSGPA_C 
1.017807 1.069368 1.068992 1.016413 
vcov(m2a)
              (Intercept)          Male      FirstGen         White
(Intercept)  0.0044353517 -0.0012011789 -0.0020911751 -0.0036963547
Male        -0.0012011789  0.0029337643  0.0002489512 -0.0002451420
FirstGen    -0.0020911751  0.0002489512  0.0075845739  0.0014043929
White       -0.0036963547 -0.0002451420  0.0014043929  0.0046207882
HSGPA_C      0.0001548344  0.0003521719 -0.0004544480 -0.0003379714
                  HSGPA_C
(Intercept)  0.0001548344
Male         0.0003521719
FirstGen    -0.0004544480
White       -0.0003379714
HSGPA_C      0.0052135130
  1. What are the log-likelihoods of each model? What can you say about the models based on those values? What can’t you say?
logLik(m1)
'log Lik.' -132.6243 (df=5)
logLik(m2)
'log Lik.' -105.5292 (df=6)
logLik(m2a)
'log Lik.' -105.5292 (df=6)
  • You cannot say anything about the individual likelihoods, only compare them. Likelihood (/log likelihood) are relative measures, unlike probability.
  • Model 2/2a has a higher (closer to 0) log likelihood than model 1.
  • Centering HSGPA doesn’t change the log likelihood.
  1. Compare the two models using a likelihood ratio test (LRT). Report the results of the test. What can you say about the models based on the test?
anova(m1, m2, test = "LRT")
Analysis of Variance Table

Model 1: GPA ~ Male + FirstGen + White
Model 2: GPA ~ Male + FirstGen + White + HSGPA
  Res.Df    RSS Df Sum of Sq  Pr(>Chi)    
1    215 43.051                           
2    214 33.614  1    9.4369 9.108e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-2*(logLik(m1) - logLik(m2))
'log Lik.' 54.1902 (df=5)
  • Model 2 is significantly better than model 1: \chi^2(1) = 54.1902, p<.0001.
  • Adding HSGPA to a model with Male, FirstGen, and White significantly improves the model.
  1. Report the results for the better model (based on the LRT). Include all regression coefficients, R^2, test statistics, p-values.
  • Intercept: b_0 = 2.885, t(214) = 43.321, p<.0001
    • The expected first year college GPA for a person who is female, not a first generation college student, not white, and who has mean high school GPA is 2.885.
  • Male: b_1 = 0.062, t(214) = 1.144, p>.05
    • Male students have a first year college GPA that is 0.062 points higher than female students, holding all other predictors constant. This effect is not significantly different from 0.
  • FirstGen: b_2 = -0.185, t(214) = -2.125, p<.05
    • First generation college students have a first year college GPA that is 0.185 points lower than that of non-first generation students, holding all other predictors constant; this effect is significant.
  • White: b_3 = 0.25738, t(214) = 3.786, p<.001
    • White students have a first year college GPA that is 0.257 points higher than their non-white classmates, holding all other predictors constant; this effect is significant.
  • HSGPA_C: b_4 = 0.55967, t(214) = 7.751, p<.0001
    • Each 1 unit increase in high school GPA corresponds to a 0.560 point higher first year college GPA, holding all other predictors constant. This effect is significant.
  • R^2_{multiple} = 0.2884, F(4, 214) = 21.68, p<.0001
    • The four predictors in this model account for 28.84% of the variance in the outcome, first year college GPA.